1 Introduction

I provide two minimal examples to compare discrete and continuous smoothing to figure out the differences. The discrete will be performed by fitting a Beta regression using a GAM with Gaussian Markov Random Fields (GMRF) as basis. The MRF are the discrete counterpart of Gaussian processes (GP).

The residuals are the geomtric ones: \(\mathrm{res} = \mathrm{freq}_{\mathrm{obs}}/\mathrm{freq}_{\mathrm{glm}}\) meaning that we really consider the Poisson distribution as valid. If \(\mathrm{res} <1\) the model overestimates the claim frequency and vice versa.

Summarizing what I’ve done in this notebook:

Moreover, I did the same pipeline on the PC3CODE and PC2CODE levels of coarse graining. I’ve done: - Use the aggregated version of the raw GLM residuals since I only have the raw residuals at the PC4CODE level. So it’s not 100% valid but it provides intuition of what would be the results. - Merge the polygons and the shape files at those two levels - Re-fit the GAM with GMRF basis - Plot the results, smoothed and raw residuals on the training set

The run is pretty fast, the entire file compile in +-15 min (can be re-written avoiding doing multiple times the same operation, will be then much faster). The most expensive part being the plotting chunks. The results (at least for the smoothing part) should be quite similar to the ICAR. The most discriminative part should be the clustering part since quantiles and hierarchical clustering are pretty different. The hierarchical clustering has the advantage to group similar areas (in terms of smoothed residuals) pairwise and build the structure. This means that zone 1 is more similar to zone 2 than to zone 3 (similarity order).

The quantiles will provide an uniform distribution among zones but two areas within a single zone might therefore be significantly different.

2 Discrete and continuous spatial smoothing at PC4CODE level

2.1 Discrete smoothing: GAM with GMRF basis smoothing

At the first stage I didn’t have a val set (only train and test already aggregated at the PC4 level). Ideally, the roughness parameter (hyper param of the GAM+GMRF) should be chosen by evalutation on the validation set. I set it by hand (as a guess value).

## OGR data source with driver: ESRI Shapefile 
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## 
## Family: Beta regression(1.804) 
## Link function: logit 
## 
## Formula:
## geom_res_freq ~ s(PC4CODE, bs = "mrf", k = roughness_param, xt = list(nb = nb))
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.2372     0.0171  -130.9   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df Chi.sq p-value    
## s(PC4CODE) 123.5  143.2   2474  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  -0.191   Deviance explained = -25.9%
## -REML = -22254  Scale est. = 1         n = 3952

However since we are speaking of residuals, I highly recommend a diverging colour palette. The differentiation between less or greater than 1 will be easier and the results better illustrated. Here is the rendering (the 0 are coded in red):

Please, forget even the existence of the jet colour map. They are so many papers on this but a picture is worth a thousand words:

One cannot distinguish anything using the jet colour map. Information that we hardly extracted from the data are hidden and non interpretable (no boundaries between positive and negative, cluster are fuzzy, non uniform and misleading colour intensity, etc.). If for an obscure reason, one really really wants a rainbow like colour map, at least one should use a uniformized rainbow map like the one provided by https://peterkovesi.com/projects/colourmaps/

2.2 Continuous smoothing: GAM with GP basis smoothing

## OGR data source with driver: ESRI Shapefile 
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## geom_res_freq ~ s(XCOORD, YCOORD, bs = "gp", k = roughness_param, 
##     m = c(1, 0.175))
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.89451    0.01177   75.97   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                  edf Ref.df     F p-value    
## s(XCOORD,YCOORD)   2      2 45.33  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.0219   Deviance explained = 2.24%
## GCV = 0.54832  Scale est. = 0.54791   n = 3952
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs

The thing is that GP are flexible but quite smooth, too smooth for the present application, besides continuous instead of discrete.

2.3 Hierarchical clustering of the smoothed residuals

Note that clustering the smoothed residuals, meaning \(\mathrm{res}_\mathrm{smooth}\) instead of \(\mathrm{res}_\mathrm{gml}\), will return a much more homogeneous spatial distribution.

Table continues below
1 2 3 4 5 6 7 8 9 10 11 12 13 14
95 206 243 205 107 178 266 69 90 122 312 202 230 221
15 16 17 18 19 20
60 77 244 237 393 395

Probably too many zones.

1 2 3 4 5 6 7 8 9 10
292 449 383 468 206 122 312 451 244 1025
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs

The zones are illustrated hereafter in qualitative colours. The zones are the levels of a categorical variable, the way this variable was built discretizing the smoothed residuals and ordered the cluster (hierarchy). Therefore it also makes sense to illustrate the zones in sequential colours. I provide this illustration below the qualitative colours chart.

The zones in sequential colours

2.4 Comparison to the SAS implemented method (weighted average and credibility)

As Allsecur did 30 zones, let’s re-cut the hierarchical tree and plot the zones for comparison.

## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs

and comparing the zones. Using the hierarchical clustering, several zones will be almost empty

Table continues below
1 2 3 4 5 6 7 8 9 10 11 12 13 14
95 125 124 205 81 107 178 153 113 119 10 90 64 176
15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
103 136 230 106 99 58 60 77 59 115 92 152 237 311 395 82

In the SAS procedure, the distribution is

Table continues below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
6 3 5 4 3 3 5 13 25 73 148 234 237 246 289 281
17 18 19 20 21 22 23 24 25 26 27 28 29 30
293 391 478 473 316 177 123 89 53 36 17 10 4 9

In qualitative colours:

2.5 Prediction on the test set

Note that since NAs have been encountered in the GMRF fitting (no observation for some zip code in the training set), one cannot predict a value for those zip code. Indeed they have been removed before the lattice setting because the lattice cannot be defined based on NAs. If you want to be able to predict for all zip code, make imputation or be sure to have at least one observation per zip code or reduce the granularity.

## OGR data source with driver: ESRI Shapefile 
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs

Let’s plot the prediction and the target values

3 Smoothing comparison of the smoothing at the PC4 and PC3 levels

Just to given an idea, I provide here the equivalent at a more coarse grained level: the PC3 zip code. To do that, I’ll take the aggregation of the provided GLM ncl just to avoid to re-predict the Emblem model at the PC2 level. If the results are satisfactory.

3.1 The different aggregation levels

First let’s visualize the difference between the two levels of granularity PC4 and PC3. In the following maps, the fill is done using the PC4 or PC3 code. No quantitative meaning, just differentiation of the areas.

## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs

3.2 Residuals at the different zipcode levels

## OGR data source with driver: ESRI Shapefile 
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs

and the exposure at the different levels:

3.3 Smoothing the residuals at the PC3 level

## OGR data source with driver: ESRI Shapefile 
## Source: "D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp", layer: "Nlp4_r14"
## with 4044 features
## It has 19 fields
## Reading layer `Nlp4_r14' from data source `D:\Users\EUDZ040\R\005_zoning_allsecur\Zoning-git\Shape\Nlp4_r14.shp' using driver `ESRI Shapefile'
## Simple feature collection with 4044 features and 19 fields
## geometry type:  MULTIPOLYGON
## dimension:      XY
## bbox:           xmin: 13565.37 ymin: 306838.8 xmax: 278026.1 ymax: 618873.9
## epsg (SRID):    NA
## proj4string:    +proj=sterea +lat_0=52.15616055555555 +lon_0=5.38763888888889 +k=0.9999079 +x_0=155000 +y_0=463000 +ellps=bessel +units=m +no_defs
## 
## Family: Beta regression(12.594) 
## Link function: logit 
## 
## Formula:
## geom_res_freq ~ s(PC3CODE, bs = "mrf", k = roughness_param, xt = list(nb = nb))
## 
## Parametric coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.65970    0.02455   -67.6   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##              edf Ref.df Chi.sq p-value    
## s(PC3CODE) 70.52  88.53  563.1  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  -0.475   Deviance explained = 65.9%
## -REML = -958.91  Scale est. = 1         n = 798

3.4 Hierarchical clustering of the smoothed residuals

4 export the zones

fwrite(zones_gmrf[, .(PC4CODE, PC3CODE, PC2CODE, PC1CODE, zone30, zone20, zone10)], “PC4zones_gamgmrf.csv”)

Table continues below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
33 50 100 39 71 50 64 23 49 34 31 31 19 4 17 17
17 18 19 20 21 22 23 24 25 26 27 28 29 30
13 12 4 14 13 16 7 29 7 10 19 16 5 1
Table continues below
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
33 89 100 102 50 64 57 49 44 19 5 24 17 16 14
16 17 18 19 20
32 26 7 29 21

Probably too many zones.

1 2 3 4 5 6 7 8 9 10
57 189 216 106 61 24 30 39 26 50

The zones in qualitative colours

The zones in sequential colours